Exploiting Text Structure for Topic Identification

نویسندگان

  • Tadashi Nomoto
  • Yuji Matsumoto
چکیده

S u m m a r y The paper demonstrates how information on text structure can be used to improve the performance on the identification of topical words in texts, which is based on a probabilistic model of text categorization. We use texts which are not explicitly structured. A text structure is identified by measuring the similarity between segments comprising the text and its title. It is shown that a text structure thus identified gives a good clue to finding out parts of the text most relevant to its content. The significance of exploiting information on the structure for topic identification is demonstrated by a set of experiments conducted on the 19Mb of Japanese newspaper articles. The paper also brings concepts from the rhetorical structure theory (RST) to the statistical analysis of a text structure. Finally, it is shown that information on text structure is more effective for large documents than for small documents.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A review of text mining approaches and their function in discovering and extracting a topic

Background and aim: Four text mining methods are examined and focused on understanding and identifying their properties and limitations in subject discovery. Methodology: The study is an analytical review of the literature of text mining and topic modeling.  Findings: LSA could be used to classify specific and unique topics in documents that address only a single topic. The other three text min...

متن کامل

Topic Segmentation Using Markov Models on Section Level

Topic segmentation, i.e. the combined task of document segmentation and topic identification, is an interesting issue both from a theoretical point of view as well as for practical applications. Previous studies have mainly focussed on applications exposing rather weak correlations regarding the topic order (e.g. Broadcast News). In this work, we concentrate on documents following a typical str...

متن کامل

Topic Analysis Using a Finite Mixture Model

We address the issue of 'topic analysis,' by which is determined a text's topic structure, which indicates what topics are included in a text, and how topics change within the text. We propose a novel approach to this issue, one based on statistical modeling and learning. We represent topics by means of word clusters, and employ a finite mixture model to represent a word distribution within a t...

متن کامل

D-VITA: A Visual Interactive Text Analysis System Using Dynamic Topic Mining

Recent developments in web technologies like Web 2.0 have led to the generation of massive amounts of data. The rapid growth of data makes knowledge extraction and trend prediction a challenging task. A recent approach for the unsupervised analysis of text corpora is dynamic topic mining. While there is a growing interest in using this technique, interactive analysis systems for dynamic topic m...

متن کامل

Damage identification of structures using experimental modal analysis and continuous wavelet transform

Abstract: Modal analysis is a powerful technique for understanding the behavior and performance of structures. Modal analysis can be conducted via artificial excitation, e.g. shaker or instrument hammer excitation. Input force and output responses are measured. That is normally referred to as experimental modal analysis (EMA). EMA consists of three steps: data acquisition, system identificatio...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1996